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Abstract 

The kernel score statistic is a global covariance component test over a set of genetic markers. It provides a flexible 
modeling framework and does not collapse marker information. We generalize the kernel score statistic to allow 
for familial dependencies and to adjust for random confounder effects. With this extension, we adjust our analysis 
of real and simulated baseline systolic blood pressure for polygenic familial background. We find that the kernel 
score test gains appreciably in power through the use of sequencing compared to tag-single-nucleotide 
polymorphisms for very rare single nucleotide polymorphisms with <1% minor allele frequency. 



Background 

Lately, much interest has been focused on global multilo- 
cus procedures that test for overall association of sets of 
markers. This is appealing if the loci carry rare variants 
or are thought to belong to a specific gene set, such as a 
pathway or genes with similar functions. The kernel 
score statistic is a specific type of such a global covariate- 
adjusted multilocus association test. Depending on con- 
text, it has received different names, such as SKAT 
(sequence kernel association test) [1]. Kernel methods 
provide flexible semiparametric regression models of 
multimarker genetic influence on expected trait outcome 
and can conveniently be implemented within the linear 
mixed-model formalism [2,3]. The kernel expresses 
genetic correlation between subjects. It allows us to test a 
whole model class by defining a prior probability in func- 
tion space in a bayesian manner [4] . In contrast, standard 
regression models are limited to just 1 a priori stated 
submodel of such a class. 

Until very recently, kernel methods were only applied 
to independent data (see Ref. [5] for a review). An exten- 
sion to include familial dependence was made for trait 
prediction [6], but not for testing genetic influence. This 
changed with independent Genetic Analysis Workshop 
18 (GAW18) contributions by ourselves and others 
(Huang et al [7], Chen et al [8], Dufresne et al [9], and 
Schifano et al [10]). Huang et al [7] applied the kernel 
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method to trios. The other 4 contributions (including 
ours) use a similar kernel extension to families with dif- 
ferent study aims. We generalize the kernel score statistic 
to adjust for dependency in metric trait data. We apply it 
to real and simulated GAW18 baseline systolic blood 
pressure (SBP) and contrast its performance on the 
sequenced panel in comparison to tag-single-nucleotide 
polymorphisms (SNPs) of a genome-wide association 
study (GWAS). 

Methods 

Data 

GAW18 provided blood pressure trait data from Mexican 
Americans participating in the San Antonio Family Heart 
Study or the San Antonio Family Diabetes/Gallbladder 
Study, and in the Type 2 Diabetes Genetic Exploration by 
Next-generation sequencing in Ethnic Samples (T2D- 
GENES) Project 2 study [11]. The sample is enriched for 
type 2 diabetes and for rare variants. Complete dense 
SNP allele dosage data were provided for 959 subjects 
from 20 large pedigrees with 22 to 86 members. Data are 
based on genome-wide tag-SNP genotyping of all sub- 
jects, whole-genome sequencing of 464 selected subjects, 
and imputation of all missing dosages for the remaining 
SNPs and subjects, exploiting kinship within pedigrees 
[11]. We considered only subjects with known baseline 
SBP, sex, and age, who were not on blood pressure medi- 
cation: 706 subjects with real SBP, excluding the first 
listed twin of monozygotic twin pairs and 740 to 781 sub- 
jects with simulated SBP (subject numbers vary for 200 
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simulated study replicates as a result of inclusion cri- 
teria). We tested candidate genes on chromosome 3 only 
(real SBP: 5 genes selected based on previous association 
reports [12], simulated SBP: gene MAP4). From group 
discussions, we knew that MAP4 associates with simu- 
lated SBP and that simulated trait Ql mimics the segre- 
gation of genes within families, but otherwise represents 
the genetic null hypothesis. 

Kernel score statistic with adjustment for random 
confounders 

We analysed baseline SBP, which has a right-skewed 
distribution toward large extremes. This was amended 
by a priori rank-normalization (Blom transformation 
[13]) of SBP to a standard normally distributed target 
variable Y s = F _1 ((r r 3/8)/(n obs +l/4)). F" 1 denotes the 
standard normal distribution quantile; n obs the number 
of observed SBP values; and r; their respective rank. 
Individual traits Yi for i = 1, n subjects were modeled 
to depend on fixed effects b of covariates X, (intercept, 
age, sex, age x sex interaction), on random effects c for 
shared familial polygenic background, and on a semi- 
parametric model h(G) of a genetic marker set G. 

Y ~ Xib T + Z iC T + h(G) (1) 

Vectors X; and Z ; represent the subject rows of 
respective design matrices X, Z for fixed covariate 
effects and random family effects. The semiparametric 
model h(G) can be written as h=Ka T with n x n dimen- 
sional kernel matrix K and multivariate normally distrib- 
uted random effects estimates a~iV(0,xK) [2]. The kernel 
score statistic tests for a genetic covariance component 
t with null hypothesis H 0 : t = 0. Most conveniently, it 
can be computed based on restricted maximum likeli- 
hood parameter estimates of the genetic null model 

Y = Xb^ + Zc T 0 + e, (2) 

which estimates only fixed covariate effects b Q , ran- 
dom pedigree effects c Q , polygenic component variance 
s 2 fam and residual variance s 2 , e~AT(0,s 2 I). Extending the 
kernel score statistic to adjust for familial dependencies, 
we obtain test statistic 

Q = (Y - Xb^V-'KV-^Y _ xb T )/2 (3) 

with matrix V 0 = s 2 I+s 2 fam ZZ T . With fixed effects esti- 
mate b 0 =(X T V 0 _1 X) _1 X T V 0 _1 Y, we rewrite test statistic 
(3) as quadratic form Q=R T MR of standard normally 
distributed residuals R= P 0 1/2 Y with matrix M=(P 0 1/2 K 
P D 1/2 )/2 and null projection matrix P 0 =V 0 _1 - V 0 _1 X 
(X T V 0 " 1 X) _1 X T V 0 " 1 . The p values for test statistic (3) 
can be calculated by the Davies exact method [14] with 
the R package CompQuadForm from sample estimates 



Q and all eigenvalues of matrix M. We adjusted for 
polygenic familial background based on the kinship 
coefficient matrix O kin =ZZ T using R-packages kinship2 
and coxme with R-function lmekin for genetic null 
model (2). We employed a linear kernel on allele 
dosages. This allows for additive genetic models on 
tested markers. The kernel matrix entry K^g^g, for 2 
subjects i, j is the scalar product of their subject-specific 
vectors g i; gj of allele dosages for N SNP SNP marker. 
SNPs were included for testing if their Hardy- Weinberg 
equilibrium test p values >10~ 5 (rounding imputed 
dosages for this purpose only). We contrasted test per- 
formance for the same subjects on 2 SNP panels, 
sequence (allele dosage data) versus GWAS (allele 
dosage data reduced to intersection with GWAS SNPs). 

Results 

Permutation testing demonstrated overall test validity: 
50,000 permutations of subject-genotype assignment 
(permuting across families) yielded type I error 0.049 ± 
0.0019, 0.00046 ± 0.00019 at significance level 0.05, 
0.0005 for simulated SBP on all sequenced rare MAP4 
SNPs. Simulated trait Ql was used to test the quality of 
polygenic adjustment. We analyzed Ql analogously to 
SBP on gene MAP4 and obtained p values <0.05 with 
rate 0.02 (all SNPs, and SNPs with minor allele fre- 
quency (MAF) >5%), rate 0.05 (for MAF 1% to 5%), and 
0.06 (for MAF <1%). This is within 95% tolerance limits 
(0.02, 0.08) for true value 0.05 on 200 replicates. 

For simulated SBP, associated gene MAP4 was detected 
with 100%, 62.5%, and 21.5% power at significance levels 
0.05, 10~ 4 , and 10~ 8 on 200 study replicates when subject- 
ing the MAP4 sequence to the global test. Power dimin- 
ished by only 0% to 2.5% when using the GWAS panel 
instead, or only MAF >5% SNPs. Figure 1 displays power 
to detect MAP4 association based on MAF <5% SNPs. We 
varied the significance level to compare test performance 
on the sequence (filled circles) with the considerably spar- 
ser GWAS panel (open circles). Sequenced data are 
powerful and outperform GWAS data particularly for very 
rare SNPs (MAF <1%) where GWAS had no power 
(Figure 1, right). GWAS outperformed sequence data on 
SNPs with MAF 1% to 5% (Figure 1, left) with just 11 tag- 
SNPs compared to 105 sequenced SNPs. 

Table 1 displays association of real SBP with the 5 
selected candidate genes on chromosome 3. P values do 
not withstand Bonferroni correction. Note, however, the 
difference between sequence (p = 0.150) and GWAS (p = 
0.031) for MECOM SNPs with MAF 1% to 5%. For 
AGTR1, sequence data tend to reduce p values compared 
to GWAS, as well as for SLC4A7 on rare variants. We 
partitioned the large MECOM, ULK4 sequences based on 
number of SNPs, as we expect that power improves by 
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Figure 1 Simulated SBP associates with gene MAP4 Power estimates of the kernel score statistic with 95% confidence limits over 200 
simulated study replicates as function of the significance level. The association between rank-normalized simulated systolic blood pressure and 
MAP4 was tested, adjusting for age, sex, age x sex interaction, and familial polygenic background. Polymorphic SNP numbers vary between 
study replicates for very rare SNPs with MAF <1% (range in legend, right panel), but not for MAF 1% to 5% (left panel). 



Table 1 Real SBP and candidate genes on chromosome 3 

Gene SNP panel Kernel score statistic (linear kernel on allele dosage data) 
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261 


0.124 
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0.776 
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Kernel score test over N SNP SNP marker for rank-normalized systolic blood pressure, adjusted for age, sex, age x sex interaction, and familial background (p <0.05 
in bold). 



changing from a situation where we test over many more 
SNPs compared to number of subjects (full gene) to a 
situation with fewer SNPs than subjects (gene parts). 

Conclusions 

The kernel score statistic is a global covariance compo- 
nent test. P values for genes are dominated by their 



common SNPs (see Table 1: all SNPs vs. MAF >5%). 
Thus, common and rare SNP sets should be tested sepa- 
rately. The kernel method offers a simple and flexible 
way to perform multimarker analysis and genetic interac- 
tion analyses by means of the kernel choice. It improves 
power compared to single-marker tests by exploiting 
SNP correlation [10], and if jointly tested SNP sets tag 
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multiple independent associated loci. The test statistic by 
Liu et al and Tzeng et al [2,3] had 17% power at the 
0.01 level for independent subjects to find MAP4 associa- 
tion with simulated SBP in 200 replicates. Our extension 
to families increased this to 98.5% power, used 84% 
more subjects with twice as many polymorphic SNPs 
(sequence), without type I error inflation. Ignoring famil- 
ial dependency would inflate type I error to 16%, 6% at 
nominal 0.05, 0.01 levels (applying the independent sub- 
jects test statistic [2,3] to families for MAP4 on polygenic 
trait Ql). We found that sequencing can improve test 
power appreciably, particularly for MAF <1% variants 
(Figure 1, right). For more frequent SNPs, the compari- 
son was not decisive and tag-SNPs proved to be strong 
competitors. With sequencing, even a single, large gene 
may provide more SNP variants than available study sub- 
jects. The kernel score test remains valid because of the 
Bayesian nature of the underlying nonparametric genetic 
model [4]. However, one might expect power loss. The 
kernel score statistic may be susceptible to random con- 
founders. For example, simplistic adjustment of familial 
and common cultural effects by a familial random inter- 
cepts model (instead of the polygenic model) yields valid 
permutation test results, but shows up as inflated type I 
errors on simulated polygenic trait Ql (p <0.05 on 10% 
of 200 study replicates for all MAP4 SNPs, inflation 
increases with decreasing MAF). This, in turn, would 
yield similar but overly optimistic estimates for Figure 1 
and Table 1. With such a liberal test, AGTR1 p values in 
Table 1 would all drop below 0.05 for all sequence SNP 
sets. We would like to emphasize that other random con- 
founder effects, such as different trait variability between 
study subgroups, can be adjusted in exact analogy to the 
familial polygenic background adjustment. 
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